Lecture 8:
Basic Data Manipulation with R
2024-11-21
Long and wide data. Source: Hugo Tavares
Consider the following data frame schwiizerChuchi. This dataset records the popularity ratings (on a scale of 1 to 10) of various Swiss dishes in different regions of Switzerland:
Which of the following statements is true?
nrow(schwiizerChuchiLong) == 12 returns TRUEdim(schwiizerChuchiLong) returns c(3, 12)dim(schwiizerChuchi) returns c(3, 12)mean(schwiizerChuchiLong$Raclette) == 8.333Why is this data frame not tidy, and what would you do to make it tidy? Write down your reasoning in numbered steps. You can write down some exact code, some higher-level code concepts, or in plain text.
Why is this data frame not tidy, and what would you do to make it tidy? Write down your reasoning in numbered steps. You can write down some exact code, some higher-level code concepts, or in plain text.
dplyr.Join setup. Source: R4DS.
Join setup. Source: R4DS.
Inner join. Source: R4DS
Left join. Source: R4DS.
Right join. Source: R4DS.
Full join. Source: R4DS.
Join Venn Diagramm. Source: R4DS.
# load packages
library(tidyverse)
# initiate data frame on persons personal spending
df_c <- data.frame(id = c(1:3,1:3),
money_spent= c(1000, 2000, 6000, 1500, 3000, 5500),
currency = c("CHF", "CHF", "USD", "EUR", "CHF", "USD"),
year=c(2017,2017,2017,2018,2018,2018))
df_c id money_spent currency year
1 1 1000 CHF 2017
2 2 2000 CHF 2017
3 3 6000 USD 2017
4 1 1500 EUR 2018
5 2 3000 CHF 2018
6 3 5500 USD 2018
# initiate data frame on persons' characteristics
df_p <- data.frame(id = 1:4,
first_name = c("Anna", "Betty", "Claire", "Diane"),
profession = c("Economist", "Data Scientist",
"Data Scientist", "Economist"))
df_p id first_name profession
1 1 Anna Economist
2 2 Betty Data Scientist
3 3 Claire Data Scientist
4 4 Diane Economist
id first_name profession money_spent currency year
1 1 Anna Economist 1000 CHF 2017
2 1 Anna Economist 1500 EUR 2018
3 2 Betty Data Scientist 2000 CHF 2017
4 2 Betty Data Scientist 3000 CHF 2018
5 3 Claire Data Scientist 6000 USD 2017
6 3 Claire Data Scientist 5500 USD 2018
7 4 Diane Economist NA <NA> NA
Overview by R4DS:
| dplyr (tidyverse) | base::merge |
|---|---|
inner_join(x, y) |
merge(x, y) |
left_join(x, y) |
merge(x, y, all.x = TRUE) |
right_join(x, y) |
merge(x, y, all.y = TRUE), |
full_join(x, y) |
merge(x, y, all = TRUE) |
Source: https://www.storybench.org/wp-content/uploads/2017/05/tidyverse.png
select, filter, arrange, mutate are the building blocks of dplyrpipeline|>.pipeline with dplyr# Traditional way
mydf <- data(swiss)
mydf <- arrange(mydf, -Catholic)
mydf <- filter(mydf, Education > 8 & Catholic > 90)
mydf <- mutate(mydf, Country = "Switzerland")
mydf <- select(mydf, Examination)
# The pipe way
mydf <- data(swiss) |>
arrange(-Catholic) |>
filter(Education > 8 & Catholic > 90) |>
mutate(Country = "Switzerland") |>
select(Examination)dplyrforecats to deal with factors;lubridate to deal with dates;stringr to deal with strings and regular expressions.